Getting Started

Before running this notebook, select “Session > Restart R and Clear Output” in the menu above to start a new R session. This will clear any old data sets and give us a blank slate to start with.

After starting a new session, run the following code chunk to load the libraries and data that we will be working with today.

French COVID-19 Data

Overview

For our next unit, we will be looking at a collection of spatio-temporal data (that is, data with a time component and spatial component) concerning the ongoing COVID-19 pandemic. We will start by looking at data from France, and then move to data from the United States. Today we will introduce tools for working with spatial data. In the next notebook we will see techniques for working with data containing a date variable.

Here are the three French datasets that we will be working with. They contain spatial, population metadata, and coronavirus numbers at the level of French départements. These are geographic areas that are one of the most important political entities in modern France. There are 101 départements; 96 in mainland Europe and 5 overseas (called the DOM, or Départements d’outre-mer).

dept <- read_sf(file.path("data", "france_departement.geojson"))
pop <- read_csv(file.path("data", "france_departement_population.csv"))
covid <- read_csv(file.path("data", "france_departement_covid.csv"))

The coronavirus data is stored with one row for each combination of day and département. We have the total number of people who died in each day from COVID-19, the total number currently in hospital, the total number currently in reanimation (this is similar to ICU, but not exactly equivalent, so I used the french term here), and the number of newly recovered. Notice that deceased and recovered are the new counts of people who have died or recovered, whereas hospitalised and reanimation are the current total numbers of patients in each group. There are columns indicating the number of new hospitalisations and reanimations, but these have many missing data points.

covid
## # A tibble: 19,998 x 9
##    date       departement departement_name deceased hospitalised reanimation
##    <date>     <chr>       <chr>               <dbl>        <dbl>       <dbl>
##  1 2020-03-18 01          Ain                     0            2           0
##  2 2020-03-18 02          Aisne                   9            0           0
##  3 2020-03-18 03          Allier                  0            0           0
##  4 2020-03-18 04          Alpes-de-Haute-…        0            3           1
##  5 2020-03-18 05          Hautes-Alpes            0            8           1
##  6 2020-03-18 06          Alpes-Maritimes         2           25           1
##  7 2020-03-18 07          Ardèche                 0            0           0
##  8 2020-03-18 08          Ardennes                0            0           0
##  9 2020-03-18 09          Ariège                  0            1           1
## 10 2020-03-18 10          Aube                    0            5           0
## # … with 19,988 more rows, and 3 more variables: recovered <dbl>,
## #   hospitalised_new <dbl>, reanimation_new <dbl>

Note that the either the departement or departement_name can be used as a primary key for the data. You only need one to uniquely describe a location.

Unlike the United States, France collects and publishes very little demographic data about its citizens. One of the few variables we will be able to look at for each département is its population, which is in the following table:

pop
## # A tibble: 101 x 2
##    departement population
##    <chr>            <dbl>
##  1 01              643350
##  2 02              534490
##  3 03              337988
##  4 04              163915
##  5 05              141284
##  6 06             1083310
##  7 07              325712
##  8 08              273579
##  9 09              153153
## 10 10              310020
## # … with 91 more rows

When working with the county-level U.S. data for project 3, you will have more demographic variables to work with.

Working with spatial data

We also have loaded spatial data about each département in France in the form of a simply frames object. The data was loaded from a “geojson” file, a plain-text, open specification for describing spatial data and associated metadata. Printing out the dataset shows that it is not too different from an “ordinary” table of data:

dept
## Simple feature collection with 101 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -61.80958 ymin: -21.38936 xmax: 55.83668 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 101 x 3
##    departement departement_name                                      geometry
##    <chr>       <chr>                                       <MULTIPOLYGON [°]>
##  1 01          Ain                 (((4.78021 46.17668, 4.78024 46.18905, 4.…
##  2 02          Aisne               (((3.17296 50.01131, 3.17382 50.01186, 3.…
##  3 03          Allier              (((3.03207 46.79491, 3.03424 46.7908, 3.0…
##  4 04          Alpes-de-Haute-Pro… (((5.67604 44.19143, 5.67817 44.19051, 5.…
##  5 05          Hautes-Alpes        (((6.26057 45.12685, 6.26417 45.12641, 6.…
##  6 06          Alpes-Maritimes     (((7.06711 43.51365, 7.06665 43.5148, 7.0…
##  7 07          Ardèche             (((4.48313 45.23645, 4.4879 45.23218, 4.4…
##  8 08          Ardennes            (((4.23316 49.95775, 4.2369 49.95858, 4.2…
##  9 09          Ariège              (((1.68842 43.27355, 1.69139 43.27173, 1.…
## 10 10          Aube                (((3.41479 48.39027, 3.41555 48.39373, 3.…
## # … with 91 more rows

Like the pop dataset, there is one row for each département. It has sole extra metadata (printed in a different RStudio window) and a special column called geometry. The geometry holds all of the information indicating where the associate geographic area is on a map.

Most plotting, data manipulation, and modeling functions can be used with a spatial data frame just the same way we used plain data frame. For example, we can do a left join with the population data and slice off the first 96 rows (these are the areas that are in Europe).

dept %>%
  left_join(pop, by = "departement") %>%
  slice(1:96)
## Simple feature collection with 96 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -5.14026 ymin: 41.33363 xmax: 9.55996 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 96 x 4
##    departement departement_name                           geometry population
##    <chr>       <chr>                            <MULTIPOLYGON [°]>      <dbl>
##  1 01          Ain                (((4.78021 46.17668, 4.78024 46…     643350
##  2 02          Aisne              (((3.17296 50.01131, 3.17382 50…     534490
##  3 03          Allier             (((3.03207 46.79491, 3.03424 46…     337988
##  4 04          Alpes-de-Haute-Pr… (((5.67604 44.19143, 5.67817 44…     163915
##  5 05          Hautes-Alpes       (((6.26057 45.12685, 6.26417 45…     141284
##  6 06          Alpes-Maritimes    (((7.06711 43.51365, 7.06665 43…    1083310
##  7 07          Ardèche            (((4.48313 45.23645, 4.4879 45.…     325712
##  8 08          Ardennes           (((4.23316 49.95775, 4.2369 49.…     273579
##  9 09          Ariège             (((1.68842 43.27355, 1.69139 43…     153153
## 10 10          Aube               (((3.41479 48.39027, 3.41555 48…     310020
## # … with 86 more rows

Notice that the spatial components of the data frame are still present after joining and slicing the data.

Standard ggplot functions work to visualise the non-spatial components of our spatial data. To show the spatial component we need to use a unique kind of geometry called geom_sf. It will plot the shapes in the dataset (by default from the geometry column) over a map. Here is an example of France using our spatial data:

dept %>%
  slice(1:96) %>%
  ggplot() +
    geom_sf()

We can control the way the map looks by adjusting the color (border color), fill (interior color), size (width of the border), and alpha (transparency of the shapes) just as with any other geometry. Here, we will make the borders very small and show the overall population of each département:

dept %>%
  left_join(pop, by = "departement") %>%
  slice(1:96) %>%
  ggplot() +
    geom_sf(aes(fill = population), color = "black", size = 0.1) +
    scale_fill_viridis_c()

You may disagree, but I think that’s a pretty nice map with not too much extra work! While you may be surprised to see this, the largest population is in fact in the Nord département and not Paris (though the latter has a much higher population density).

Two other spatial geometries exist: geom_sf_text and geom_sf_label for adding labels to a plot. For example, here we can name some of the areas:

dept %>%
  left_join(pop, by = "departement") %>%
  slice(1:96) %>%
  ggplot() +
    geom_sf(color = "black", fill = "white", alpha = 0.4, size = 0.1) +
    geom_sf_text(aes(label = departement_name), check_overlap = TRUE, size = 2)
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

Note that some areas are not labelled because we set check_overlap to TRUE.

Spatial operations

We can also use the spatial information in our dataset to compute metrics about the geometric areas. For example, the function st_area computes the total area of each value in the geometry column (two extra functions are needed to convert the output to a usable number of square-kilometers).

dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2")))
## Simple feature collection with 101 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -61.80958 ymin: -21.38936 xmax: 55.83668 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 101 x 4
##    departement departement_name                                geometry  area
##  * <chr>       <chr>                                 <MULTIPOLYGON [°]> <dbl>
##  1 01          Ain                (((4.78021 46.17668, 4.78024 46.1890… 5785.
##  2 02          Aisne              (((3.17296 50.01131, 3.17382 50.0118… 7412.
##  3 03          Allier             (((3.03207 46.79491, 3.03424 46.7908… 7379.
##  4 04          Alpes-de-Haute-Pr… (((5.67604 44.19143, 5.67817 44.1905… 6995.
##  5 05          Hautes-Alpes       (((6.26057 45.12685, 6.26417 45.1264… 5691.
##  6 06          Alpes-Maritimes    (((7.06711 43.51365, 7.06665 43.5148… 4294.
##  7 07          Ardèche            (((4.48313 45.23645, 4.4879 45.23218… 5566.
##  8 08          Ardennes           (((4.23316 49.95775, 4.2369 49.95858… 5245.
##  9 09          Ariège             (((1.68842 43.27355, 1.69139 43.2717… 4912.
## 10 10          Aube               (((3.41479 48.39027, 3.41555 48.3937… 6028.
## # … with 91 more rows

From there, we could join to the population data and compute the population density of each département. We can also use the function sm_centroid to compute the lon and lat coordinates of the centroid of each region in our dataset. This is useful for quickly plotting large spatial datasets without needing all the machinery of the spatial geometries (which are great, but can be slow at scale).

dept %>%
  mutate(sm_centroid(geometry))
## Simple feature collection with 101 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -61.80958 ymin: -21.38936 xmax: 55.83668 ymax: 51.089
## geographic CRS: WGS 84
## # A tibble: 101 x 5
##    departement departement_name                          geometry   lon   lat
##  * <chr>       <chr>                           <MULTIPOLYGON [°]> <dbl> <dbl>
##  1 01          Ain               (((4.78021 46.17668, 4.78024 46…  5.35  46.1
##  2 02          Aisne             (((3.17296 50.01131, 3.17382 50…  3.56  49.6
##  3 03          Allier            (((3.03207 46.79491, 3.03424 46…  3.19  46.4
##  4 04          Alpes-de-Haute-P… (((5.67604 44.19143, 5.67817 44…  6.24  44.1
##  5 05          Hautes-Alpes      (((6.26057 45.12685, 6.26417 45…  6.26  44.7
##  6 06          Alpes-Maritimes   (((7.06711 43.51365, 7.06665 43…  7.12  43.9
##  7 07          Ardèche           (((4.48313 45.23645, 4.4879 45.…  4.43  44.8
##  8 08          Ardennes          (((4.23316 49.95775, 4.2369 49.…  4.64  49.6
##  9 09          Ariège            (((1.68842 43.27355, 1.69139 43…  1.50  42.9
## 10 10          Aube              (((3.41479 48.39027, 3.41555 48…  4.16  48.3
## # … with 91 more rows

We can also use the st_transform function to project the coordinates in our dataset into a coordinate system to best plot our data. Each coordinate system uses a numeric code called is EPSG (https://epsg.io/) code; you can look up the best one for the region you are interested in. For example, for Metropolitain France we might use EPSG:3943. This can be done with the following code; note that this create a slightly more accurate map of the data (note that the lines of longitude are no-longer parallel):

dept %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf()

Likewise, for Guadeloupe, Martinique and Guyane we might use EPSG:2972:

dept %>%
  slice(97:99) %>%
  st_transform(2972) %>%
  ggplot() +
    geom_sf(alpha = 0.1) +
    geom_sf_text(aes(label = departement_name), size = 2)

And for La Réunion and Mayotte we might use EPSG:5879:

dept %>%
  slice(100:101) %>%
  st_transform(5879) %>%
  ggplot() +
    geom_sf(alpha = 0.1) +
    geom_sf_text(aes(label = departement_name), size = 2)

We will see that these projections help particularly when using large regions like the U.S.; they are also particularly useful for projecting data near the North or South Pole.

Practice

Map projections

To get a better sense of how map projections work, slice the dept to include only rows 96-101. This gives one département in Europe and all of the DOM. Plot the data using both geom_sf and geom_sf_text (for a label, use departement_name). Do not project the data, so that it uses just longitude and latitude. Note that Val-d’Oise is in Europe.

dept %>%
  slice(96:101) %>%
  ggplot() +
    geom_sf() +
    geom_sf_text(aes(label = departement_name))
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

Take the code from above and use the EPSG-3943 projection; this was recommend for use with European France. Notice that the European part of the map has the lat- and lon- lines the closest to parallel to the x- and y-axes.

dept %>%
  st_transform(3943) %>%
  slice(96:101) %>%
  ggplot() +
    geom_sf() +
    geom_sf_text(aes(label = departement_name))

Finally, use the EPSG:5879 projection with the same data:

dept %>%
  st_transform(5879) %>%
  slice(96:101) %>%
  ggplot() +
    geom_sf() +
    geom_sf_text(aes(label = departement_name))

Note how distorted the rest of the plot becomes, but how nice the lines look near Mayotte and La Réunion.

Population Density

Combine the methods in the notes to add a population density variable to the dataset (people per square kilometer). Plot the data spatially using color to show the population density for the first 96 rows of the dataset. Use a viridis color scale appropropriate project.

dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2"))) %>%
  left_join(pop, by = "departement") %>%
  mutate(density = population / area) %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf(aes(fill = density), color = "black", size = 0.1) +
    scale_fill_viridis_c()

You should see that Paris and the the surrounding areas are by far the most dense areas. This is because they are the only départements that include only a dense city area and not any of the surrounding countryside. To fix this we need to be more careful about how we define a color palette.

Usually, we use color only as a secondary element in a plot. However, in a map we often need to use color to show the main feature of interest. This means that we have to be fairly careful about how colors are defined. Change the plot from your previous question to have the following scale:

  • scale_fill_distiller(trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10)
dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2"))) %>%
  left_join(pop, by = "departement") %>%
  mutate(density = population / area) %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf(aes(fill = density), color = "black", size = 0.1) +
    scale_fill_distiller(trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10)

Other options for palettes include:

  • Diverging: BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral
  • Qualitative: Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3
  • Sequential: Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd

Keep this in mind as we continue to work with spatial data, as the defaults will often look terrible and be hard to interpret.

Creating spatial data

It is time to get a little bit more complicated. Let’s read in a dataset of the largest French cities:

cities <- read_csv(file.path("data", "france_cities.csv"))
## Parsed with column specification:
## cols(
##   city = col_character(),
##   lat = col_double(),
##   lon = col_double(),
##   population = col_double(),
##   admin_name = col_character()
## )
cities
## # A tibble: 58 x 5
##    city         lat    lon population admin_name                
##    <chr>      <dbl>  <dbl>      <dbl> <chr>                     
##  1 Paris       48.9  2.33     9904000 Ile-de-France             
##  2 Lyon        45.8  4.83     1423000 Auvergne-Rhone-Alpes      
##  3 Marseille   43.3  5.38     1400000 Provence-Alpes-Cote d'Azur
##  4 Lille       50.6  3.08     1044000 Hauts-de-France           
##  5 Nice        43.7  7.26      927000 Provence-Alpes-Cote d'Azur
##  6 Toulouse    43.6  1.45      847000 Occitanie                 
##  7 Bordeaux    44.8 -0.595     803000 Nouvelle-Aquitaine        
##  8 Rouen       49.4  1.08      532559 Normandie                 
##  9 Strasbourg  48.6  7.75      439972 Grand Est                 
## 10 Nantes      47.2 -1.59      438537 Pays de la Loire          
## # … with 48 more rows

This is just a normal data frame object, but we can convert it into a spatial object with the following code:

cities <- st_as_sf(cities, coords = c("lon", "lat"), remove = FALSE)
st_crs(cities) <- 4326
cities
## Simple feature collection with 58 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -4.495 ymin: 41.9271 xmax: 9.45 ymax: 50.9504
## geographic CRS: WGS 84
## # A tibble: 58 x 6
##    city        lat    lon population admin_name                      geometry
##  * <chr>     <dbl>  <dbl>      <dbl> <chr>                        <POINT [°]>
##  1 Paris      48.9  2.33     9904000 Ile-de-France           (2.3333 48.8667)
##  2 Lyon       45.8  4.83     1423000 Auvergne-Rhone-Alpes        (4.83 45.77)
##  3 Marseille  43.3  5.38     1400000 Provence-Alpes-Cote…       (5.375 43.29)
##  4 Lille      50.6  3.08     1044000 Hauts-de-France             (3.08 50.65)
##  5 Nice       43.7  7.26      927000 Provence-Alpes-Cote…      (7.265 43.715)
##  6 Toulouse   43.6  1.45      847000 Occitanie                 (1.4499 43.62)
##  7 Bordeaux   44.8 -0.595     803000 Nouvelle-Aquitaine        (-0.595 44.85)
##  8 Rouen      49.4  1.08      532559 Normandie                 (1.08 49.4304)
##  9 Strasbou…  48.6  7.75      439972 Grand Est                   (7.75 48.58)
## 10 Nantes     47.2 -1.59      438537 Pays de la Loire         (-1.59 47.2104)
## # … with 48 more rows

Unlike the geometries in dept, each row here is only a single point rather than a polygon. We can do spatial scatterplots without dealing with geometry objects, but need to do this convertion to make projections work correctly.

In the code below, take the last plat you made and add a geom_sf_label layer showing the names and locations of the larger 10 French cities. This helps add context to the plot for anyone not particularly familiar with the locations of the major cities (Paris is already easy to see; the others less-so).

dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2"))) %>%
  left_join(pop, by = "departement") %>%
  mutate(density = population / area) %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf(aes(fill = density), color = "black", size = 0.1, alpha = 0.5) +
    scale_fill_distiller(trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10) +
    geom_sf_label(aes(label = city), data = slice(cities, 1:10))

You can adjust the number of cities based on how large the plot is.

Integrating the COVID-19 data

Finally, let’s integrate a bit of the COVID-19 data into our spatial analysis. Start by creating a dataset called covid_summary that has one row for each département and a column giving the total number of COVID-19 deaths per 1000 residents. This will require using the pop and covid datasets.

covid_summary <- covid %>%
  filter(date == "2020-04-01") %>%
  left_join(pop, by = "departement") %>%
  mutate(morality_rate = deceased / population * 1000)
covid_summary
## # A tibble: 101 x 11
##    date       departement departement_name deceased hospitalised reanimation
##    <date>     <chr>       <chr>               <dbl>        <dbl>       <dbl>
##  1 2020-04-01 01          Ain                    12           92          28
##  2 2020-04-01 02          Aisne                  59          164          40
##  3 2020-04-01 03          Allier                  5           38          13
##  4 2020-04-01 04          Alpes-de-Haute-…        2           22           5
##  5 2020-04-01 05          Hautes-Alpes            1           52          11
##  6 2020-04-01 06          Alpes-Maritimes        29          149          41
##  7 2020-04-01 07          Ardèche                17           85          11
##  8 2020-04-01 08          Ardennes                1           54          16
##  9 2020-04-01 09          Ariège                  0           16           6
## 10 2020-04-01 10          Aube                   18           86          15
## # … with 91 more rows, and 5 more variables: recovered <dbl>,
## #   hospitalised_new <dbl>, reanimation_new <dbl>, population <dbl>,
## #   morality_rate <dbl>

For the 96 départements in Metropolitain France, draw a map showing the number of people who have died from COVID-19 per 1000 residents. Consider using the code you had in the previous plot (with a distiller palette and the top cities shown).

dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2"))) %>%
  left_join(covid_summary, by = "departement") %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf(aes(fill = morality_rate), color = "black", size = 0.1, alpha = 0.5) +
    scale_fill_distiller(trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10) +
    geom_sf_label(aes(label = city), data = slice(cities, 1:10), size = 2) +
    labs(
      x = "", y = "", fill = "Deaths Per 1000 People",
      title = "COVID-19 Mortality in France Métropolitaine",
      subtitle = "As of 10 October 2020",
      caption = "Data Source: Santé publique France and data.gouv.fr"
    ) +
    theme_void()
## Warning: Transformation introduced infinite values in discrete y-axis

As a last step (you will find you can copy much of the previous code), show the number of people who were hospitalised per 100k residents on “2020-10-01”.

covid_now <- covid %>%
  filter(date == "2020-10-01") %>%
  left_join(pop, by = "departement") %>%
  mutate(hosp_rate = hospitalised / population * 100000)

dept %>%
  mutate(area = as.numeric(set_units(st_area(geometry), "km^2"))) %>%
  left_join(covid_now, by = "departement") %>%
  slice(1:96) %>%
  st_transform(3943) %>%
  ggplot() +
    geom_sf(aes(fill = hosp_rate), color = "black", size = 0.1, alpha = 0.5) +
    scale_fill_distiller(trans = "log2", palette = "Spectral", guide = "legend", n.breaks = 10) +
    geom_sf_label(aes(label = city), data = slice(cities, 1:10), size = 2) +
    labs(
      x = "", y = "", fill = "Rate",
      title = "COVID-19 Hospitalization Rate (per 10k residents)",
      subtitle = "On 01 October 2020",
      caption = "Data Source: Santé publique France and data.gouv.fr"
    ) +
    theme_void()

If you have time remaining, consider adding some labels to the last two plots to make them closer to something that you might include a published report. You might consider adding theme_void to minimize the ink used on the axes.